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Science  fiction  has  long  imagined  a  future  populated 
with  artificial  humans — human-looking  devices  with 
human-like  intelligence.  Although  Asimov’s  benevolent 
robots  and  the  Terminator  movies’  terrible  war  machines 

are  still  a  distant  fantasy,  researchers  across  a  wide  range 
of  disciplines  are  beginning  to  work  together  toward  a 
more  modest  goal — building  virtual  humans.  These  soft¬ 
ware  entities  look  and  act  like  people  and  can  engage  in 
conversation  and  collaborative  tasks,  but  they  live  in  simu¬ 
lated  environments.  With  the  untidy  problems  of  sensing 
and  acting  in  the  physical  world  thus  dispensed,  the  focus 
of  virtual  human  research  is  on  capturing  the  richness  and 
dynamics  of  human  behavior. 

The  potential  applications  of  this  technology  are  con¬ 
siderable.  History  students  could  visit  ancient  Greece 
and  debate  Aristotle.  Patients  with  social  phobias  could 
rehearse  threatening  social  situations  in  the  safety  of  a 
virtual  environment.  Social  psychologists  could  study 
theories  of  communication  by  systematically  modifying 
a  virtual  human’s  verbal  and  nonverbal  behavior.  A  vari¬ 
ety  of  applications  are  already  in  progress,  including 
education  and  training,^  therapy,^  marketing, and 
entertainment.^’^ 

Building  a  virtual  human  is  a  multidisciplinary  effort, 
joining  traditional  artificial  intelligence  problems  with  a 
range  of  issues  from  computer  graphics  to  social  science. 
Virtual  humans  must  act  and  react  in  their  simulated  envi¬ 
ronment,  drawing  on  the  disciplines  of  automated  reason¬ 
ing  and  planning.  To  hold  a  conversation,  they  must  exploit 
the  full  gamut  of  natural  language  processing  research, 
from  speech  recognition  and  natural  language  understand¬ 
ing  to  natural  language  generation  and  speech  synthesis. 
Providing  human  bodies  that  can  be  controlled  in  real  time 
delves  into  computer  graphics  and  animation.  And  because 
an  agent  looks  like  a  human,  people  expect  it  to  behave  like 
one  as  well  and  will  be  disturbed  by,  or  misinterpret,  dis¬ 


crepancies  from  human  norms.  Thus,  virtual  human  research 
must  draw  heavily  on  psychology  and  communication 
theory  to  appropriately  convey  nonverbal  behavior,  emotion, 
and  personality. 

This  broad  range  of  requirements  poses  a  serious  prob¬ 
lem.  Researchers  working  on  particular  aspects  of  virtual 
humans  cannot  explore  their  component  in  the  context  of 
a  complete  virtual  human  unless  they  can  understand 
results  across  this  array  of  disciplines  and  assemble  the 
vast  range  of  software  tools  (for  example,  speech  recog¬ 
nizers,  planners,  and  animation  systems)  required  to  con¬ 
struct  one.  Moreover,  these  tools  were  rarely  designed  to 
interoperate  and,  worse,  were  often  designed  with  differ¬ 
ent  purposes  in  mind.  For  example,  most  computer  graph¬ 
ics  research  has  focused  on  high  fidelity  offline  image 
rendering  that  does  not  support  the  fine-grained  interac¬ 
tive  control  that  a  virtual  human  must  have  over  its  body. 

In  the  spring  of  2002,  about  30  international  researchers 
from  across  disciplines  convened  at  the  University  of 
Southern  California  to  begin  to  bridge  this  gap  in  knowl¬ 
edge  and  tools  (see  www.ict.usc.edu/~vhumans).  Our 
ultimate  goal  is  a  modular  architecture  and  interface  stan¬ 
dards  that  will  allow  researchers  in  this  area  to  reuse  each 
other’s  work.  This  goal  can  only  be  achieved  through  a 
close  multidisciplinary  collaboration.  Towards  this  end, 
the  workshop  gathered  a  collection  of  experts  representing 
the  range  of  required  research  areas,  including 

•  Human  figure  animation 

•  Facial  animation 

•  Perception 

•  Cognitive  modeling 

•  Emotions  and  personality 

•  Natural  language  processing 

•  Speech  recognition  and  synthesis 

•  Nonverbal  communication 

•  Distributed  simulation 

•  Computer  games 
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Here  we  discuss  some  of  the  key  issues 
that  must  be  addressed  in  creating  virtual 
humans.  As  a  first  step,  we  overview  the 
issues  and  available  tools  in  three  key  areas 
of  virtual  human  research:  face-to-face 
conversation,  emotions  and  personality, 
and  human  figure  animation. 

Face-to-face  conversation 

Human  face-to-face  conversation  involves 
both  language  and  nonverbal  behavior.  The 
behaviors  during  conversation  don’t  just 
function  in  parallel,  but  interdependently. 
The  meaning  of  a  word  informs  the  inter¬ 
pretation  of  a  gesture,  and  vice  versa.  The 
time  scales  of  these  behaviors,  however,  are 
different — a  quick  look  at  the  other  person 
to  check  that  they  are  listening  lasts  for  less 
time  than  it  takes  to  pronounce  a  single 
word,  while  a  hand  gesture  that  indicates 
what  the  word  “caulk”  means  might  last 
longer  than  it  takes  to  say,  “I  caulked  all 
weekend.” 

Coordinating  verbal  and  nonverbal  con¬ 
versational  behaviors  for  virtual  humans 
requires  meeting  several  interrelated  chal¬ 
lenges.  How  speech,  intonation,  gaze,  and 
head  movements  make  meaning  together, 
the  patterns  of  their  co-occurrence  in  con¬ 
versation,  and  what  kinds  of  goals  are 
achieved  by  the  different  channels,  are  all 
equally  important  for  understanding  the 
construction  of  virtual  humans.  Speech  and 
nonverbal  behaviors  do  not  always  manifest 
the  same  information,  but  what  they  convey 
is  virtually  always  compatible.^  In  many 
cases,  different  modalities  serve  to  reinforce 
one  another  through  redundancy  of  mean¬ 
ing.  In  other  cases,  semantic  and  pragmatic 
attributes  of  the  message  are  distributed 
across  the  modalities.^  The  compatibility  of 
meaning  between  gestures  and  speech 
recalls  the  interaction  of  words  and  graph¬ 
ics  in  multimodal  presentations.^  For  pat¬ 
terns  of  co-occurrence,  there  is  a  tight  syn¬ 
chrony  among  the  different  conversational 
modalities  in  humans.  For  example,  people 
accentuate  important  words  by  speaking 
more  forcefully,  illustrating  their  point  with 
a  gesture,  and  turning  their  eyes  toward  the 
listener  when  coming  to  the  end  of  a  thought. 
Meanwhile  listeners  nod  within  a  few  hun¬ 
dred  milliseconds  of  when  the  speaker’s 
gaze  shifts.  This  synchrony  is  essential  to 
the  meaning  of  conversation.  When  it  is 
destroyed,  as  in  low  bandwidth  videocon¬ 
ferencing,  satisfaction  and  trust  in  the  out¬ 
come  of  a  conversation  diminishes. 


Regarding  the  goals  achieved  by  the 
different  modalities,  in  natural  conversa¬ 
tion  speakers  tend  to  produce  a  gesture 
with  respect  to  their  propositional  goals  (to 
advance  the  conversation  content),  such  as 
making  the  first  two  fingers  look  like  legs 
walking  when  saying  “it  took  15  minutes  to 
get  here,”  and  speakers  tend  to  use  eye 
movement  with  respect  to  interactional 
goals  (to  ease  the  conversation  process), 
such  as  looking  toward  the  other  person 
when  giving  up  the  turn.^  To  realistically 
generate  all  the  different  verbal  and  non¬ 
verbal  behaviors,  then,  computational 
architectures  for  virtual  humans  must  con¬ 
trol  both  the  propositional  and  interactional 
structures.  In  addition,  because  some  of 
these  goals  can  be  equally  well  met  by  one 
modality  or  the  other,  the  architecture  must 
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deal  at  the  level  of  goals  or  functions,  and  not 
at  the  level  of  modalities  or  behaviors.  That 
is,  giving  up  the  turn  is  often  achieved  by 
looking  at  the  listener.  But,  if  the  speaker’s 
eyes  are  on  the  road,  he  or  she  can  get  a 
response  by  saying,  “Don’t  you  think?” 

Constructing  a  virtual  human  that  can 
effectively  participate  in  face-to-face  con¬ 
versation  requires  a  control  architecture 
with  the  following  features:^ 

•  Multimodal  input  and  output.  Because 
humans  in  face-to-face  conversation 
send  and  receive  information  through 
gesture,  intonation,  and  gaze  as  well  as 
speech,  the  architecture  should  also  sup¬ 
port  receiving  and  transmitting  this 
information. 

•  Real-time  feedback.  The  system  must  let 
the  speaker  watch  for  feedback  and  turn 
requests,  while  the  listener  can  send 
these  at  any  time  through  various  modal¬ 


ities.  The  architecture  should  be  flexible 
enough  to  track  these  different  threads  of 
communication  in  ways  appropriate  to 
each  thread.  Different  threads  have  dif¬ 
ferent  response-time  requirements; 
some,  such  as  feedback  and  interruption, 
occur  on  a  sub-second  time  scale.  The 
architecture  should  reflect  this  by  allow¬ 
ing  different  processes  to  concentrate  on 
activities  at  different  time  scales. 

•  Understanding  and  synthesis  of  proposi¬ 
tional  and  interactional  information. 
Dealing  with  propositional  information — 
the  communication  content — ^requires 
building  a  model  of  the  user’s  needs  and 
knowledge.  The  architecture  must  in¬ 
clude  a  static  domain  knowledge  base 
and  a  dynamic  discourse  knowledge 
base.  Presenting  propositional  informa¬ 
tion  requires  a  planning  module  for 
presenting  multi-sentence  output  and 
managing  the  order  of  presentation  of 
interdependent  facts.  Understanding 
interactional  information — about  the 
processes  of  conversation — on  the  other 
hand,  entails  building  a  model  of  the  cur¬ 
rent  state  of  the  conversation  with  respect 
to  the  conversational  process  (to  deter¬ 
mine  who  is  the  current  speaker  and  lis¬ 
tener,  has  the  listener  understood  the 
speaker’s  contribution,  and  so  on). 

•  Conversational  function  model.  Func¬ 
tions,  such  as  initiating  a  conversation  or 
giving  up  the  floor,  can  be  achieved  by  a 
range  of  different  behaviors,  such  as 
looking  repeatedly  at  another  person  or 
bringing  your  hands  down  to  your  lap. 
Explicitly  representing  conversational 
functions,  rather  than  behaviors,  provides 
both  modularity  and  a  principled  way  to 
combine  different  modalities.  Functional 
models  influence  the  architecture  because 
the  core  system  modules  operate  exclu¬ 
sively  on  functions,  while  other  system 
modules  at  the  edges  translate  input 
behaviors  into  functions,  and  functions 
into  output  behaviors.  This  also  produces 
a  symmetric  architecture  because  the 
same  functions  and  modalities  are  pre¬ 
sent  in  both  input  and  output. 

To  capture  different  time  scales  and  the 
importance  of  co-occurrence,  input  to  a 
virtual  human  must  be  incremental  and 
time  stamped.  For  example,  incremental 
speech  recognition  lets  the  virtual  human 
give  feedback  (such  as  a  quick  nod)  right  as 
the  real  human  finishes  a  sentence,  there- 
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Figure  1.  Behavior  Expression  Animation  Toolkit  text-to-nonverbal  behavior  module. 


fore  influencing  the  direction  the  human 
speaker  takes.  At  the  very  least,  the  sytem 
should  report  a  significant  change  in  state 
right  away,  even  if  full  information  about 
the  event  has  not  yet  been  processed.  This 
means  that  if  speech  recognition  cannot  be 
incremental,  at  least  someone  speaking  or 
finished  speaking  should  be  relayed  imme¬ 
diately,  even  in  the  absence  of  a  fully  rec¬ 
ognized  utterance.  This  lets  the  virtual 
human  give  up  the  turn  when  the  real 
human  claims  it  and  signal  reception  after 
being  addressed.  When  dealing  with  multi¬ 
ple  modalities,  fusing  interpretations  of  the 
different  input  events  is  important  to  under¬ 
stand  what  behaviors  are  acting  together  to 
convey  meaning.  For  this,  a  synchronized 
clock  across  modalities  is  crucial  so  events 
such  as  exactly  when  an  emphasis  beat 
gesture  occurs  can  be  compared  to  speech, 
word  by  word.  This  requires,  of  course, 
that  the  speech  recognizer  supply  word 
onset  times. 

Similarly,  for  the  virtual  human  to  pro¬ 
duce  a  multimodal  performance,  the  output 
channels  also  must  be  incremental  and 
tightly  synchronized.  Incremental  refers  to 
two  properties  in  particular:  seamless  tran¬ 
sitions  and  interruptible  behavior.  When 
producing  certain  behaviors,  such  as  ges¬ 
tures,  the  virtual  human  must  reconfigure 
its  limbs  in  a  natural  manner,  usually  requir¬ 
ing  that  some  time  be  spent  on  interpolat¬ 
ing  from  a  previous  posture  to  a  new  one. 
For  the  transition  to  be  seamless,  the  virtual 
human  must  give  the  animation  system 
advance  notice  of  events  such  as  gestures, 
so  that  it  has  time  to  bring  the  arms  into 
place.  Sometimes,  however,  behaviors 
must  be  abruptly  interrupted,  such  as  when 
the  real  human  takes  the  turn  before  the 
virtual  human  has  finished  speaking.  In 


that  case,  the  current  behavior  schedule 
must  be  scrapped,  the  voice  halted,  and 
new  attentive  behaviors  initiated — all  with 
reasonable  seamlessness. 

Synchronicity  between  modalities  is  as 
important  in  the  output  as  the  input.  The 
virtual  human  must  align  a  graphical 
behavior  with  the  uttering  of  particular 
words  or  a  group  of  words.  The  temporal 
association  between  the  words  and  behav¬ 
iors  might  have  been  resolved  as  part  of  the 
behavior  generation  process,  as  is  done  in 
SPUD  (Sentence  Planning  Using  Descrip¬ 
tion),^  but  it  is  essential  that  the  speech 
synthesizer  provide  a  mechanism  for  main¬ 
taining  synchrony  through  the  final  produc¬ 
tion  stage.  There  are  two  types  of  mecha¬ 
nisms,  event  based  or  time  based.  A  text-to- 
speech  engine  can  usually  be  programmed 
to  send  events  on  phoneme  and  word  bound¬ 
aries.  Although  this  is  geared  towards  sup¬ 
porting  lip  synch,  other  behaviors  can  be 
executed  as  well.  However,  this  does  not 
allow  any  time  for  behavior  preparation. 
Preferably,  the  TTS  engine  can  provide 
exact  start-times  for  each  word  prior  to 
playing  back  the  voice,  as  Festival  does.^^ 
This  way,  we  can  schedule  the  behaviors, 
and  thus  the  transitions  between  behaviors, 
beforehand,  and  then  play  them  back 
along  with  the  voice  for  a  perfectly  seam¬ 
less  performance. 

On  the  output  side,  one  tool  that  provides 
such  tight  synchronicity  is  the  Behavior 
Expression  Animation  Toolkit  system. 
Figure  1  shows  BEAT’S  architecture.  BEAT 
has  the  advantage  of  automatically  annotat¬ 
ing  text  with  hand  gestures,  eye  gaze,  eye¬ 
brow  movement,  and  intonation.  The  anno¬ 
tation  is  carried  out  in  XML,  through  inter¬ 
action  with  an  embedded  word  ontology 
module,  which  creates  a  set  of  hypernyms 


that  broadens  a  knowledge  base  search  of 
the  domain  being  discussed.  The  annotation 
is  then  passed  to  a  set  of  behavior  genera¬ 
tion  rules.  Output  is  scheduled  so  that  tight 
synchronization  is  maintained  among 
modalities. 

Emotions  and  personality 

People  infuse  their  verbal  and  nonverbal 
behavior  with  emotion  and  personality,  and 
modeling  such  behavior  is  essential  for 
building  believable  virtual  humans.  Conse¬ 
quently,  researchers  have  developed  com¬ 
putational  models  for  a  wide  range  of  appli¬ 
cations.  Computational  approaches  might 
be  roughly  divided  into  communication- 
driven  and  simulation-based  approaches. 

In  communication-driven  approaches,  a 
virtual  human  chooses  its  emotional  expres¬ 
sion  on  the  basis  of  its  desired  impact  on  the 
user.  Catherine  Pelachaud  and  her  colleagues 
use  facial  expressions  to  convey  affect  in 
combination  with  other  communicative 
functions. Eor  example,  making  a  request 
with  a  sorrowful  face  can  evoke  pity  and 
motivate  an  affirmative  response  from  the 
listener.  An  interesting  feature  of  their 
approach  is  that  the  agent  deliberately 
plans  whether  or  not  to  convey  a  certain 
emotion.  Tutoring  applications  usually  also 
follow  a  communication-driven  approach, 
intentionally  expressing  emotions  with  the 
goal  of  motivating  the  students  and  thus 
increasing  the  learning  effect.  The  Cosmo 
system,  where  the  agent’s  pedagogical 
goals  drive  the  selection  and  sequencing  of 
emotive  behaviors,  is  one  example.  Eor 
instance,  a  congratulatory  act  triggers  a 
motivational  goal  to  express  admiration 
that  is  conveyed  with  applause.  To  convey 
appropriate  emotive  behaviors,  agents  such 
as  Cosmo  need  to  appraise  events  not  only 
from  their  own  perspective  but  also  from 
the  perspective  of  others. 

The  second  category  of  approaches  aims 
at  a  simulation  of  “true”  emotion  (as  op¬ 
posed  to  deliberately  conveyed  emotion). 
These  approaches  build  on  appraisal  theo¬ 
ries  of  emotion,  the  most  prominent  being 
Andrew  Ortony,  Gerald  Clore,  and  Allan 
Collins’  cognitive  appraisal  theory — com¬ 
monly  referred  to  as  the  OCC  model. 

This  theory  views  emotions  as  arising  from 
a  valenced  reaction  to  events  and  objects  in 
the  light  of  agent  goals,  standards,  and  atti¬ 
tudes.  Eor  example,  an  agent  watching  a 
game- winning  move  should  respond  differ¬ 
ently  depending  on  which  team  is  preferred.^ 
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Figure  2.  Pelachaud  and  colleagues  use  a  MPEG-4  compatible  facial  animation  system 
to  investigate  how  to  resolve  conflicts  that  arise  when  different  communication  functions 
need  to  be  shown  on  different  channels  of  the  face. 


Recent  work  by  Stacy  Marsella  and 
Jonathan  Gratch  integrates  the  OCC  model 
with  coping  theories  that  explain  how  peo¬ 
ple  cope  with  strong  emotions. For  exam¬ 
ple,  their  agents  can  engage  in  either  prob¬ 
lem-focused  coping  strategies,  selecting 
and  executing  actions  in  the  world  that 
could  improve  the  agent’s  emotional  state, 
or  emotion-focused  coping  strategies, 
improving  emotional  state  by  altering  the 
agent’s  mental  state  (for  example,  dealing 
with  guilt  by  blaming  someone  else).  Fur¬ 
ther  simulation  approaches  are  based  on 
the  observation  that  an  agent  should  be 
able  to  dynamically  adapt  its  emotions 
through  its  own  experience,  using  learning 
mechanisms. 

Appraisal  theories  focus  on  the  relation¬ 
ship  between  an  agent’s  world  assessment 
and  the  resulting  emotions.  Nevertheless, 
they  are  rather  vague  about  the  assessment 
process.  For  instance,  they  do  not  explain 
how  to  determine  whether  a  certain  event  is 
desirable.  A  promising  line  of  research  is 
integrating  appraisal  theories  with  AI- 
based  planning  approaches, which  might 
lead  to  a  concretization  of  such  theories. 
First,  emotions  can  arise  in  response  to  a 
deliberative  planning  process  (when  rele¬ 
vant  risks  are  noticed,  progress  assessed, 
and  success  detected).  For  example,  several 
approaches  derive  an  emotion’s  intensity 
from  the  importance  of  a  goal  and  its  prob¬ 
ability  of  achievement.  Second,  emo¬ 
tions  can  influence  decision-making  by 
allocating  cognitive  resources  to  specific 
goals  or  threats.  Plan-based  approaches 
support  the  implementation  of  decision  and 
action  selection  mechanisms  that  are 
guided  by  an  agent’s  emotional  state.  For 
example,  the  Inhabited  Market  Place  appli¬ 
cation  treats  emotions  as  filters  to  constrain 
the  decision  process  when  selecting  and 
instantiating  dialogue  operators.^ 

In  addition  to  generating  affective  states, 
we  must  also  express  them  in  a  manner  easily 
interpretable  to  the  user.  Effective  means  of 
conveying  emotions  include  body  gestures, 
acoustic  realization,  and  facial  expressions 
(see  Gary  Collier’s  work  for  an  overview  of 
studies  on  emotive  expressions^^).  Several 
researchers  use  Bayesian  networks  to 
model  the  relationship  between  emotion 
and  its  behavioral  expression.  Bayesian 
networks  let  us  deal  explicitly  with  uncer¬ 
tainty,  which  is  a  great  advantage  when 
modeling  the  connections  between  emo¬ 
tions  and  the  resulting  behaviors.  Gene 


Ball  and  Jack  Breese  presented  an  example 
of  such  an  approach.  They  constructed  a 
Bayesian  network  that  estimates  the  likeli¬ 
hood  of  specific  body  postures  and  gestures 
for  individuals  with  different  personality 
types  and  emotions.^^  For  instance,  a  nega¬ 
tive  emotion  increases  the  probability  that 
an  agent  will  say  “Oh,  you  again,”  as 
opposed  to  “Nice  to  see  you!” 

Recent  work  by  Catherine  Pelachaud  and 
colleagues  employs  Bayesian  networks  to 
resolve  conflicts  that  occur  when  different 
communicative  functions  need  to  be  shown 
on  different  channels  of  the  face,  such  as 
eyebrows,  mouth  shape,  gaze  direction, 
head  direction,  and  head  movements  (see 
Figure  2).^^  In  this  case,  the  Bayesian  net¬ 
work  estimates  the  likelihood  that  a  face 
movement  overrides  another.  Bayesian  net¬ 
works  also  offer  a  possibility  to  model  how 
emotions  vary  over  time.  Even  though  nei¬ 
ther  Ball  and  Breese  nor  Pelachaud  and 
colleagues  took  advantage  of  this  feature, 
the  extension  of  the  two  approaches  to 
dynamic  Bayesian  networks  seems  obvious. 

While  significant  progress  has  been 
made  on  the  visualization  of  emotive 
behaviors,  automated  speech  synthesis  still 
has  a  long  way  to  go.  The  most  natural¬ 
sounding  approaches  rely  on  a  large  inven¬ 
tory  of  human  speech  units  (for  example, 
combinations  of  phonemes)  that  are  subse¬ 
quently  selected  and  combined  based  on 
the  sentence  to  be  synthesized.  These 
approaches  do  not,  yet,  provide  much  abil¬ 
ity  to  convey  emotion  through  speech  (for 
example,  by  varying  prosody  or  intensity). 
Marc  Schroder  provides  an  overview  of 
speech  manipulations  that  have  been  suc¬ 
cessfully  employed  to  express  several  basic 
emotions.^^  While  the  interest  in  affective 
speech  synthesis  is  increasing,  hardly  any 


work  has  been  done  on  conveying  emotion 
through  sentence  structure  or  word  choice. 
An  exception  includes  Eduard  Hovy’s  pio¬ 
neering  work  on  natural  language  genera¬ 
tion  that  addresses  not  only  the  goal  of 
information  delivery,  but  also  pragmatic 
aspects,  such  as  the  speaker’s  emotions. 
Marilyn  Walker  and  colleagues  present  a 
first  approach  to  integrating  acoustic  para¬ 
meters  with  other  linguistic  phenomena, 
such  as  sentence  structure  and  wording.^^ 

Obviously,  there  is  a  close  relationship 
between  emotion  and  personality.  Dave 
Moffat  differentiates  between  personality 
and  emotion  using  the  two  dimensions 
duration  and  focus. Whereas  personality 
remains  stable  over  a  long  period  of  time, 
emotions  are  short-lived.  Moreover,  while 
emotions  focus  on  particular  events  or 
objects,  factors  determining  personality  are 
more  diffuse  and  indirect.  Because  of  this 
obvious  relationship,  several  projects  aim 
to  develop  an  integrated  model  of  emotion 
and  personality.  As  an  example.  Ball  and 
Breese  model  dependencies  between  emo¬ 
tions  and  personahty  in  a  Bayesian  network.^^ 
To  enhance  the  believability  of  animated 
agents  beyond  reasoning  about  emotion 
and  personality,  Helmut  Prendinger  and 
colleagues  model  the  relationship  between 
an  agent’s  social  role  and  the  associated 
constraints  on  emotion  expression,  for 
example,  by  suppressing  negative  emo¬ 
tion  when  interacting  with  higher- status 
individuals.^^ 

Another  line  of  research  aims  at  provid¬ 
ing  an  enabling  technology  to  support 
affective  interactions.  This  includes  both 
the  definition  of  standardized  languages  for 
specifying  emotive  behaviors,  such  as  the 
Affective  Presentation  Markup  Language 
or  the  Emotion  Markup  Language  (www. 
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Figure  3.  PeopleShop  and  Dl-Guy  are  used  to  create  scenarios  for  ground  combat 
training.  This  scenario  was  used  at  Ft.  Banning  to  enhance  situation  awareness  in 
experiments  to  train  US  Army  officers  for  urban  combat.  Image  courtesy  of  Boston 
Dynamics. 


vhml.org),  as  well  as  the  implementation  of 
toolkits  for  affective  computing  combining 
a  set  of  components  addressing  affective 
knowledge  acquisition,  representation, 
reasoning,  planning,  communication,  and 
expression.^^ 

Human  figure  animation 

By  engaging  in  face-to-face  conversation, 
conveying  emotion  and  personality,  and 
otherwise  interacting  with  the  synthetic 
environment,  virtual  humans  impose  fairly 
severe  behavioral  requirements  on  the  un¬ 
derlying  animation  system  that  must  render 
their  physical  bodies.  Most  production 
work  involves  animator  effort  to  design 
or  script  movements  or  direct  performer 
motion  capture.  Replaying  movements  in 
real  time  is  not  the  issue;  rather,  it  is  creating 
novel,  contextually  sensitive  movements  in 
real  time  that  matters.  Interactive  and  con¬ 
versational  agents,  for  example,  will  not 
enjoy  the  luxury  of  relying  on  animators  to 
create  human  time-frame  responses.  Anima¬ 
tion  techniques  must  span  a  variety  of  body 
systems:  locomotion,  manual  gestures,  hand 
movements,  body  pose,  faces,  eyes,  speech, 
and  other  physiological  necessities  such  as 
breathing,  blinking,  and  perspiring.  Research 
in  human  figure  animation  has  addressed  all 
of  these  modalities,  but  historically  the  work 
focuses  either  on  the  animation  of  complete 


body  movements  or  on  animation  of  the 
face. 

Body  animation  methods 

In  body  animation,  there  are  two  basic 
ways  to  gain  the  required  interactivity:  use 
motion  capture  and  additional  techniques 
to  rapidly  modify  or  re-target  movements 
to  immediate  needs,^^  or  write  procedural 
code  that  allows  program  control  over 
important  movement  parameters.^^  The 
difficulty  with  the  motion  capture  approach 
is  maintaining  environmental  constraints 
such  as  solid  foot  contacts  and  proper 
reach,  grasp,  and  observation  interactions 
with  the  agent’s  own  body  parts  and  other 
objects.  To  alleviate  these  problems,  pro¬ 
cedural  approaches  parameterize  target 
locations,  motion  qualities,  and  other 
movement  constraints  to  form  a  plausible 
movement  directly.  Procedural  approaches 
consist  of  kinematic  and  dynamics  tech¬ 
niques.  Each  has  its  preferred  domain  of 
applicability;  kinematics  is  generally  better 
for  goal-directed  activities,  and  slower 
(controlled)  actions  and  dynamics  is  more 
natural  for  movements  directed  by  applica¬ 
tion  of  forces,  impacts,  or  high-speed 
behaviors. The  wide  range  of  human 
movement  demands  that  both  approaches 
have  real-time  implementations  that  can  be 
procedurally  selected  as  required. 


Animating  a  human  body  form  requires 
more  than  just  controlling  skeletal  rotation 
angles.  People  are  neither  skeletons  nor 
robots,  and  considerable  human  qualities 
arise  from  intelligent  movement  strategies, 
soft  deformable  surfaces,  and  clothing. 
Movement  strategies  include  reach  or  con¬ 
strained  contacts,  often  achieved  with  goal- 
directed  inverse  kinematics. Complex 
workplaces,  however,  entail  more  complex 
planning  to  avoid  collisions,  find  free 
paths,  and  optimize  strength  availability. 
The  suppleness  of  human  skin  and  the 
underlying  tissue  biomechanics  lead  to 
shape  changes  caused  by  internal  muscle 
actions  as  well  as  external  contact  with  the 
environment.  Modeling  and  animating  the 
local,  muscle-based,  deformation  of  body 
surfaces  in  real  time  is  possible  through 
shape  morphing  techniques,^^’^^  but  provid¬ 
ing  appropriate  shape  changes  in  response 
to  external  forces  is  a  challenging  problem. 
“Skin-tight”  texture  mapped  clothing  is 
prevalent  in  computer  game  characters, 
but  animating  draped  or  flowing  garments 
requires  dynamic  simulation,  fast  colli¬ 
sion  detection,  and  appropriate  collision 
response. 

Accordingly,  animation  systems  build 
procedural  models  of  these  various  behav¬ 
iors  and  execute  them  on  human  models. 
The  diversity  of  body  movements  involved 
has  led  to  building  more  consistent  agents: 
procedural  animations  that  affect  and  con¬ 
trol  multiple  body  communication  chan¬ 
nels  in  coordinated  ways. The  par¬ 
ticular  challenge  here  is  constructing 
computer  graphics  human  models  that  bal¬ 
ance  sufficient  articulation,  detail,  and 
motion  generators  to  effect  both  gross  and 
subtle  movements  with  realism,  real-time 
responsiveness,  and  visual  acceptability. 
And  if  that  isn’t  enough,  consider  the  addi¬ 
tional  difficulty  of  modeling  a  specific  real 
individual.  Computer  graphics  still  lacks 
effective  techniques  to  transfer  even  cap¬ 
tured  motion  into  features  that  characterize 
a  specific  person’s  mannerisms  and  behav¬ 
iors,  though  machine-learning  approaches 
could  prove  promising."^^ 

Implementing  an  animated  human  body 
is  complicated  by  a  relative  paucity  of  gen¬ 
erally  available  tools.  Body  models  tend  to 
be  proprietary  (for  example,  Extempo.com, 
Ananova.com),  optimized  for  real  time  and 
thus  limited  in  body  structure  and  features 
(for  example,  DI-Guy,  BDI.com,  illustrated 
in  Figure  3),  or  constructions  for  particular 
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animations  built  with  standard  animator 
tools  such  as  Poser,  Maya,  or  3DSMax.  The 
best  attempt  to  design  a  transportable,  stan¬ 
dard  avatar  is  the  Web3D  Consortium’s  H- 
Anim  effort  (www.h-anim.org).  With  well- 
defined  body  structure  and  feature  sites,  the 
H-Anim  specification  has  engendered 
model  sharing  and  testing  not  possible  with 
proprietary  approaches.  The  present  liability 
is  the  lack  of  an  application  programming 
interface  in  the  VRML  language  binding  of 
H-Anim.  A  general  API  for  human  models 
is  a  highly  desirable  next  step,  the  benefits 
of  which  have  been  demonstrated  by  Nor¬ 
man  Badler’s  research  group’s  use  of  the 
software  API  in  Jack  (www.ugs.com/prod- 
ucts/efactory/jack),  which  allows  feature 
access  and  provides  plug-in  extensions  for 
new  real-time  behaviors. 

Face  animation  methods 

A  computer- animated  human  face  can 
evoke  a  wide  range  of  emotions  in  real  peo¬ 
ple  because  faces  are  central  to  human  real¬ 
ity.  Unfortunately,  modeling  and  rendering 
artifacts  can  easily  produce  a  negative 
response  in  the  viewer.  The  great  complex¬ 
ity  and  psychological  depth  of  the  human 
response  to  faces  causes  difficulty  in  pre¬ 
dicting  the  response  to  a  given  animated 
face  model.  The  partial  or  minimalist  ren¬ 
dering  of  a  face  can  be  pleasing  as  long  as 
it  maintains  quality  and  accuracy  in  certain 
key  dimensions.  The  ultimate  goal  is  to 
analyze  and  synthesize  humans  with 
enough  fidelity  and  control  to  pass  the  Tur¬ 
ing  test,  create  any  kind  of  virtual  being, 
and  enable  total  control  over  its  virtual 
appearance.  Eventually,  surviving  tech¬ 
nologies  will  be  combined  to  increase 
accuracy  and  efficiency  of  the  capture,  lin¬ 
guistic,  and  rendering  systems.  Currently 
the  approaches  to  animating  the  face  are 
disjoint  and  driven  by  production  costs  and 
imperfect  technology.  Each  method  pre¬ 
sents  a  distinct  “look  and  feel,”  as  well  as 
advantages  and  disadvantages. 

Eacial  animation  methods  fall  into  three 
major  categories.  The  first  and  earliest 
method  is  to  manually  generate  keyframes 
and  then  automatically  interpolate  frames 
between  the  keyframes  (or  use  less  skilled 
animators).  This  approach  is  used  in  tradi¬ 
tional  cell  animation  and  in  3D  animated 
feature  films.  Keyframe  and  morph  target 
animation  provides  complete  artistic  con¬ 
trol  but  can  be  time  consuming  to  perfect. 

The  second  method  is  to  synthesize  facial 


movements  from  text  or  acoustic  speech.  A 
TTS  algorithm,  or  an  acoustic  speech  recog¬ 
nizer,  provides  a  translation  to  phonemes, 
which  are  then  mapped  to  visemes  (visual 
phonemes).  The  visemes  drive  a  speech 
articulation  model  that  animates  the  face. 
The  convincing  synthesis  of  a  face  from  text 
has  yet  to  be  accomplished.  The  state  of  the 
art  provides  understandable  acoustic  and 
visual  speech  and  facial  expressions.^^’^^ 
The  third  and  most  recent  method  for 
animating  a  face  model  is  to  measure 
human  facial  movements  directly  and  then 
apply  the  motion  data  to  the  face  model. 
The  model  can  capture  facial  motions 
using  one  or  more  cameras  and  can  incor¬ 
porate  face  markers,  structured  light,  laser 
range  finders,  and  other  face  measurement 
modes.  Each  facial  motion  capture  approach 
has  limitations  that  might  require  postpro¬ 
cessing  to  overcome.  The  ideal  motion- 
capture  data  representation  supports  suffi¬ 
cient  detail  without  sacrificing  editability 
(for  example,  MPEG-4  Eacial  Animation 
Parameters).  The  choice  of  modeling  and 
rendering  technologies  ranges  from  2D 
line  drawings  to  physics-based  3D  models 
with  muscles,  skin,  and  bone.^^  "^^  Of 
course,  textured  polygons  (nonuniform 
rational  b-splines  and  subdivision  sur¬ 
faces)  are  by  far  the  most  common.  A 
variety  of  surface  deformation  schemes 
exist  that  attempt  to  simulate  the  natural 
deformations  of  the  human  face  while 
driven  by  external  parameter 

MPEG-4,  which  was  designed  for  high- 
quality  visual  communication  at  low  bit- 
rates  coupled  with  low-cost  graphics  ren¬ 
dering  systems,  offers  one  existing  standard 
for  human  figure  animation.  It  contains  a 
comprehensive  set  of  tools  for  representing 
and  compressing  content  objects  and  the 
animation  of  those  objects,  and  it  treats 
virtual  humans  (faces  and  bodies)  as  a  spe¬ 
cial  type  of  object.  The  MPEG-4  Eace  and 
Body  Animation  standard  provides 
anatomically  specific  locations  and  anima¬ 
tion  parameters.  It  defines  Eace  Definition 
Parameter  feature  points  and  locates  them 
on  the  face  (see  Eigure  4).  Some  of  these 
points  only  serve  to  help  define  the  face’s 
shape.  The  rest  of  them  are  displaced  by 
Eacial  Animation  Parameters,  which  spec¬ 
ify  feature  point  displacements  from  the 
neutral  face  position.  Some  EAPs  are 
descriptors  for  visemes  and  emotional 
expressions.  Most  remaining  EAPs  are  nor¬ 
malized  to  be  proportional  to  neutral  face 


mouth  width,  mouth-nose  distance,  eye 
separation,  iris  diameter,  or  eye-nose  dis¬ 
tance.  Although  MPEG-4  has  defined  a 
limited  set  of  visemes  and  facial  expres¬ 
sions,  designers  can  specify  two  visemes  or 
two  expressions  with  a  blend  factor  between 
the  visemes  and  an  intensity  value  for  each 
expression.  The  normalization  of  the  EAPs 
gives  the  face  model  designer  freedom  to 
create  characters  with  any  facial  propor¬ 
tions,  regardless  of  the  source  of  the  EAPs. 
They  can  embed  MPEG-4  compliant  face 
models  into  decoders,  store  them  on  CD- 
ROM,  download  them  as  an  executable 
from  a  Web  site,  or  build  them  into  a  Web 
browser. 

Integration  challenges 

Integrating  all  the  various  elements 
described  here  into  a  virtual  human  is  a 
daunting  task.  It  is  difficult  for  any  single 
research  group  to  do  it  alone.  Reusable 
tools  and  modular  architectures  would  be 
an  enormous  benefit  to  virtual  human 
researchers,  letting  them  leverage  each 
other’s  work.  Indeed,  some  research  groups 
have  begun  to  share  tools,  and  several  stan¬ 
dards  have  recently  emerged  that  will  fur¬ 
ther  encourage  sharing.  However,  we  must 
confront  several  difficult  issues  before  we 
can  readily  plug-and-play  different  modules 
to  control  a  virtual  human’s  behavior.  Two 
key  issues  discussed  at  the  workshop  were 
consistency  and  timing  of  behavior. 

Consistency 

When  combining  a  variety  of  behavioral 
components,  one  problem  is  maintaining 
consistency  between  the  agent’s  internal 
state  (for  example,  goals,  plans,  and  emo¬ 
tions)  and  the  various  channels  of  outward 
behavior  (for  example,  speech  and  body 
movements).  When  real  people  present 
multiple  behavior  channels,  we  interpret 
them  for  consistency,  honesty,  and  sincerity, 
and  for  social  roles,  relationships,  power, 
and  intention.  When  these  channels  con¬ 
flict,  the  agent  might  simply  look  clumsy  or 
awkward,  but  it  could  appear  insincere, 
confused,  conflicted,  emotionally  detached, 
repetitious,  or  simply  fake.  To  an  actor  or  an 
expert  animator,  this  is  obvious.  Bad  actors 
might  fail  to  control  gestures  or  facial 
expressions  to  portray  the  demeanor  of  their 
persona  in  a  given  situation.  The  actor 
might  not  have  internalized  the  character’s 
goals  and  motivations  enough  to  use  the 
body’s  own  machinery  to  manifest  these 
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<»  Other  feature  points 


YTrrYY 

9.11 

9.9 

Teeth 


Figure  4.  The  set  of  MPEG-4  Face  Definition  Parameter  (FDP)  feature  points. 


inner  drives  as  appropriate  behaviors.  A 
skilled  animator  (and  actor)  knows  that  all 
aspects  of  a  character  must  be  consistent 
with  its  desired  mental  state  because  we  can 
control  only  voice,  body  shape,  and  move¬ 
ment  for  the  final  product.  We  cannot  open 
a  dialog  with  a  pre- animated  character  to 
further  probe  its  mind  or  its  psychological 
state.  With  a  real-time  embodied  agent, 
however,  we  might  indeed  have  such  an 
opportunity. 

One  approach  to  remedying  this  problem 
is  to  explicitly  coordinate  the  agent’s  internal 
state  with  the  expression  of  body  movements 
in  all  possible  channels.  For  example,  Nor¬ 
man  Badler’s  research  group  has  been  build¬ 
ing  a  system,  EMOTE,  to  parameterize  and 
modulate  action  performance.^^  It  is  based 
on  Laban  Movement  Analysis,  a  human 
movement  observation  System.  EMOTE  is 
not  an  action  selector  per  se;  it  is  used  to 
modify  the  execution  of  a  given  behavior  and 
thus  change  its  movement  qualities  or  char¬ 


acter.  EMOTE’ s  power  arises  from  the  rela¬ 
tively  small  number  of  parameters  that  con¬ 
trol  or  affect  a  much  larger  set,  and  from  new 
extensions  to  the  original  definitions  that 
include  non-articulated  face  movements.  The 
same  set  of  parameters  control  many  aspects 
of  manifest  behavior  across  the  agent’s  body 
and  therefore  permit  experimentation  with 
similar  or  dissimilar  settings.  The  hypothesis 
is  that  behaviors  manifest  in  separate  chan¬ 
nels  with  similar  EMOTE  parameters  will 
appear  consistent  to  some  internal  state  of 
the  agent;  conversely,  dissimilar  EMOTE 
parameters  will  convey  various  negative 
impressions  of  the  character’s  internal  con¬ 
sistency.  Most  computer-animated  agents 
provide  direct  evidence  for  the  latter  view: 

•  Arm  gestures  without  facial  expressions 
look  odd. 

•  Eacial  expressions  with  neutral  gestures 
look  artificial. 

•  Arm  gestures  without  torso  involvement 


look  insincere. 

•  Attempts  at  emotions  in  gait  variations 
look  funny  without  concomitant  body 
and  facial  affect. 

•  Otherwise  carefully  timed  gestures  and 
speech  fail  to  register  with  gesture  per¬ 
formance  and  facial  expressions. 

•  Repetitious  actions  become  irritating 
because  they  appear  unconcerned  about 
our  changing  (more  negative)  feelings 
about  them. 


Timing 

In  working  together  toward  a  unifying 
architecture,  timing  emerged  as  a  central 
concern  at  the  workshop.  A  virtual  human’s 
behavior  must  unfold  over  time,  subject  to  a 
variety  of  temporal  constraints.  Eor  exam¬ 
ple,  speech-related  gestures  must  closely 
follow  the  voice  cadence.  It  became  obvi¬ 
ous  during  the  workshop  that  previous  work 
focused  on  a  specific  aspect  of  behavior  (for 
example,  speech,  reactivity,  or  emotion), 
leading  to  architectures  that  are  tuned  to  a 
subset  of  timing  constraints  and  cannot 
straightforwardly  incorporate  others.  Dur¬ 
ing  the  final  day  of  the  workshop,  we  strug¬ 
gled  with  possible  architectures  that  might 
address  this  limitation. 

Eor  example,  BEAT  schedules  speech- 
related  body  movements  using  a  pipelined 
architecture:  a  text-to-speech  system  gener¬ 
ates  a  fixed  timeline  to  which  a  subsequent 
gesture  scheduler  must  conform.  Essen¬ 
tially,  behavior  is  a  slave  to  the  timing  con¬ 
straints  of  the  speech  synthesis  tool.  In  con¬ 
trast,  systems  that  try  to  physically  convey  a 
sense  of  emotion  or  personality  often  work 
by  altering  the  time  course  of  gestures. 

Eor  example,  EMOTE  works  later  in  the 
pipeline,  taking  a  previously  generated 
sequence  of  gestures  and  shortening  or 
drawing  them  out  for  emotional  effect. 
Essentially,  behavior  is  a  slave  to  the  con¬ 
straints  of  emotional  dynamics.  Einally, 
some  systems  have  focused  on  making  the 
character  highly  reactive  and  embedded  in 
the  synthetic  environment.  Eor  example, 

Mr.  Bubb  of  Zoesis  Studios  (see  Eigure  5)  is 
tightly  responsive  to  unpredictable  and  con¬ 
tinuous  changes  in  the  environment  (such  as 
mouse  movements  or  bouncing  balls).  In 
such  systems,  behavior  is  a  slave  to  envi¬ 
ronmental  dynamics.  Clearly,  if  these  vari¬ 
ous  capabilities  are  to  be  combined,  we 
must  reconcile  these  different  approaches. 

One  outcome  of  the  workshop  was  a 
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number  of  promising  proposals  for  recon¬ 
ciling  these  competing  constraints.  At  the 
very  least,  much  more  information  must 
be  shared  between  components  in  the 
pipeline.  For  example,  if  BEAT  had  more 
access  to  timing  constraints  generated  by 
EMOTE,  it  could  do  a  better  job  of  up¬ 
front  scheduling.  Another  possibility  would 
be  to  specify  all  of  the  constraints  explicitly 
and  devise  an  animation  system  flexible 
enough  to  handle  them  all,  an  approach  the 
motion  graph  technique  suggests. Nor¬ 
man  Badler  suggests  an  interesting  pipeline 
architecture  that  consists  of  “fat”  pipes 
with  weak  uplinks.  Modules  would  send 
down  considerably  more  information  (and 
possibly  multiple  options)  and  could  poll 
downstream  modules  for  relevant  informa¬ 
tion  (for  example,  how  long  would  it  take 
to  look  at  the  ball,  given  its  current  loca¬ 
tion).  Exploring  these  and  other  alterna¬ 
tives  is  an  important  open  problem  in  vir¬ 
tual  human  research. 


he  future  of  androids  remains  to  be 
seen,  but  realistic  interactive  virtual  humans 
will  almost  certainly  populate  our  near 
future,  guiding  us  toward  opportunities  to 
learn,  enjoy,  and  consume.  The  move  toward 
sharable  tools  and  modular  architectures 
will  certainly  hasten  this  progress,  and, 
although  significant  challenges  remain, 
work  is  progressing  on  multiple  fronts.  The 
emergence  of  animation  standards  such  as 
MPEG-4  and  H-Anim  has  already  facili¬ 
tated  the  modular  separation  of  animation 
from  behavioral  controllers  and  sparked  the 
development  of  higher-level  extensions 
such  as  the  Affective  Presentation  Markup 
Language.  Researchers  are  already  sharing 
behavioral  models  such  as  BEAT  and 
EMOTE.  We  have  outlined  only  a  subset  of 
the  many  issues  that  arise,  ignoring  many  of 
the  more  classical  AI  issues  such  as  percep¬ 
tion,  planning,  and  learning.  Nonetheless, 
we  have  highlighted  the  considerable  recent 
progress  towards  interactive  virtual  humans 
and  some  of  the  key  challenges  that  remain. 
Assembling  a  new  virtual  human  is  still  a 
daunting  task,  but  the  building  blocks  are 
getting  bigger  and  better  every  day.  H 


Figure  5.  Mr.  Bubb  is  an  interactive  character  developed  by  Zoesis  Studios  that  reacts 
continously  to  the  user's  social  interactions  during  cooperative  play  experience.  Image 
courtesy  of  Zoesis  Studios. 
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